Educative: Interactive Courses for Software Developers

In the previous section, we identified an anomaly. We found an increase in the 400 status code because the z-score was 6. But how do we set the threshold for the z-score? Is a z-score of 3 an anomaly? What about 2, or 1?

To find thresholds that fit our needs, we can run simulations on past data with different values and evaluate the results. This is often called backtesting.

The first thing we need to do is to calculate the mean and the standard deviation for each status code up until every row, just as if it is the current value. This is a classic job for a window function:

WITH calculations_over_window AS (
   SELECT
      status_code,
      period,
      entries,
      AVG(entries) OVER status_window as mean_entries,
      STDDEV(entries) OVER status_window as stddev_entries
   FROM
      server_log_summary
   WINDOW status_window AS (
      PARTITION BY status_code
      ORDER BY period
      ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
   )
)
SELECT *
FROM calculations_over_window
ORDER BY period DESC
LIMIT 20;

Enter to Rename, Shift+Enter to Preview

We use a window function to calculate the mean and standard deviation over a sliding window of 60 minutes. To avoid repeating the WINDOW clause for every aggregate, we define a named window called “status_window.” This is another nice feature of PostgreSQL.

In the results, we can now see that we have the mean and standard deviation of the previous 60 rows for every entry. This is similar to the calculation we did in the previous section except this time we’ll do it for every row.

Now we can calculate the z-score for every row:

WITH calculations_over_window AS (
   SELECT
      status_code,
      period,
      entries,
      AVG(entries) OVER status_window as mean_entries,
      STDDEV(entries) OVER status_window as stddev_entries
   FROM
      server_log_summary
   WINDOW status_window AS (
      PARTITION BY status_code
      ORDER BY period
      ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
   )
),
with_zscore AS (
   SELECT
       *,
       (entries - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
   FROM
       calculations_over_window
)
SELECT
   status_code,
   period,
   zscore
FROM
   with_zscore
ORDER BY

Enter to Rename, Shift+Enter to Preview

We now have z-scores for every row, and we can try to identify anomalies:

WITH calculations_over_window AS (
   SELECT
       status_code,
       period,
       entries,
       AVG(entries) OVER status_window as mean_entries,
       STDDEV(entries) OVER status_window as stddev_entries
   FROM
       server_log_summary
   WINDOW status_window AS (
       PARTITION BY status_code
       ORDER BY period
       ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
   )
),
with_zscore AS (
   SELECT
       *,
       (entries - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
   FROM
       calculations_over_window
),
with_alert AS (
   SELECT
       *,
       zscore > 3 AS alert
   FROM
       with_zscore

Enter to Rename, Shift+Enter to Preview

We decided to classify values with a z-score greater than 3 as anomalies.You’ll often find that 3 is the magic number in textbooks, but do not get sentimental about it, because you can definitely change it to get better results.

Detecting Anomalies

Analyzing A Server Log

Backtesting

Improving Accuracy

Conclusion

Finding Past Anomalies